adelie_shuffled <- withr::with_seed(2,
adelie_bills |>
specify(bill_length_mm ~ sex) |>
hypothesize(null = "independence") |>
infer::generate(reps = 1000, type = "permute")
)Statistical thinking — statistical inference
2024-09-27
Statistical inference relates to using probabilities to make decisions about parameters of interest in a population from statistics (estimates) derived from a subset or sample from that population
Are the population means of two groups, control & treated, different?
Key concepts
Interested in the efficacy of a new drug for treating disease in animals
The population of interest is all the animals with the disease across herds in a sector
We could recruit the farmers managing all the herds in the sector into a study, and give each animal either the new drug or a control
Measure the recovery time of the two groups
Problems are:
An alternative approach is to sample from the population of interest
Take a random, representative subset of animals (or herds) from the population of interest
How representative the sample is depends on how it was collected
Many statistical methods assume that the data represent a random sample from a population
Difficult to do in practice
Alternatively we could cluster sample (random stratified sampling); select n representative herds and then randomly sample individuals from within those herds
Statistics computed on samples will be rarely exactly equal to the population values
Is there evidence of sexual dimorphism in Adelie penguin bill lengths?
A distinction is often made between parameters and statistics
A sample statistic is an estimates of a population parameter
Interchange between statistics and parameter estimates
Probabilities describe the chance that an event will occur
Probabilities range from 0 to 1, with smaller values indicating the occurrence unlikely and larger values indicating occurrence is likely
Compute probabilities as \(\displaystyle \frac{m}{M}\); \(m\) is the number of possible outcomes of interest \(M\) is the number of all possible outcomes
Probability of a head on a single coin toss: \(\frac{1}{2} = 0.5\)
Probability of rolling an odd number on a 10-side die: \(\frac{5}{10} = 0.5\)
Probability of rolling a value less than 3 on a 10-sided die: \(\frac{2}{10} = 0.2\)
What is the probability of randomly selecting two samples with means as different as observed if the samples came from the same population?
Need to consider the sampling error:
Earlier we saw a difference of means of the bill lengths of male and female Adelie penguins of 3.18
How likely is a difference of means of >3? How likely is a difference of means of >0.5?
Read off from the cumulative distribution function of the sampling distribution of statistic
Often, the null hypothesis, \(\text{H}_0\), is that there is no effect
\(\text{H}_0\): there is no effect of sex on the bill length of Adelie penguins
This means that, under \(\text{H}_0\) we can rearrange the sex variable randomly, pairing any value of bill_length_mm with a male or female penguin
Under \(\text{H}_0\), these shufflings of the data should look similar to our observed data
Under \(\text{H}_0\), these shufflings of the data should look similar to our observed data
Do the shuffled data look like the observed data?
What we just did is the basis of a permutation test of \(\text{H}_0\)
A real test would shuffle — permute — the data many more times than 4
Response: bill_length_mm (numeric)
Explanatory: sex (factor)
Null Hypothesis: independence
# A tibble: 40,000 × 3
# Groups: replicate [1,000]
bill_length_mm sex replicate
<dbl> <fct> <int>
1 41.1 female 1
2 39.6 female 1
3 36.9 female 1
4 40.1 female 1
5 42.7 female 1
6 38.1 female 1
7 36.7 female 1
8 40.6 female 1
9 39.2 female 1
10 36.6 female 1
# ℹ 39,990 more rows
With the infer 📦 we
specify() to say which variables are involved in the test and which is the response and which are the explanatory variables, andhypothesize() to indicate what are we testing — here we are testing the independence of bill_length_mm and sex, and thengenerate() to create 10,000 permutations of the data, which areadelie_shuffledWe also wrapped all that in withr::with_seed(), which is another way of setting random seeds for the RNG
Now we have 1,000 randomly shuffled — permuted — data sets
Our null hypothesis (\(\text{H}_0\)) is that there is
no difference in the mean bill length of male and female Adelie penguins
For each permuted data set we need to calculate the
bill_length_mm for the males — \(\hat{\mu}_{\text{Males}}\)bill_length_mm for the females — \(\hat{\mu}_{\text{Females}}\)This difference of means will be our test statistic
We could do those calculations using dplyr but infer makes it even easier via calculate()
"diff in means" is how we indicate we want a difference of meansorder indicates the order of the means in the differenceThe order doesn’t matter, so long as we are consistent
We did \(\hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}}\) so we will put male first
This generates the null distribution of the test statistic
The null distribution shows the distribution of the test statistic under \(\text{H}_0\) of no effect
We also need to compute the observed value of the test statistic (i.e. the one for our data)
Do this using the same pipeline, without the hypothesize() and generate() steps
If we add the the observed value to the plot we can see how typical our observed difference of means is to the values we would expect if there was no effect pf sex
Whether a test is one- or two-tailed is a common distinction in research studies
Relates to the alternative hypothesis:
This is where concept of one- or two-tailed tests comes from; in which tail do we find the rejection region
If we add the the observed value to the plot we can see how typical our observed difference of means is to the values we would expect if there was no effect of sex
We asked for direction = "both" so the alternative hypothesis is that the males and females have different mean bill lengths
We typically denote the alternative hypothesis as \(\text{H}_1\) or \(\text{H}_{\text{a}}\)
This hypothesis is usually what we are interested in knowing about, but we can’t test it explicitly
\(\text{H}_{\text{a}}\) could be:
Option 1 is a two-sided test so we set direction to be "two-sided" or "both"
Option 2 is a one-sided test so we set direction to be "greater" or "right"
Option 3 is also a one-sided test but we set direction to be "less" or "left"
(for order = c("male", "female")!)
Before observing the penguins, if we had hypothesized that males had longer bills than females, we could have done a one-sided of the hypothesis
male Adelie penguins have longer bill lengths on average than females
\[\begin{align*} \text{H}_{\text{0}} & : \hat{\mu}_{\text{Males}} = \hat{\mu}_{\text{Females}} \\ \text{H}_{\text{a}} & : \hat{\mu}_{\text{Males}} > \hat{\mu}_{\text{Females}} \end{align*}\]
It might be easier to express our hypotheses in terms of the difference of means
\[\begin{align*} \text{H}_{\text{0}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} = 0 \\ \text{vs H}_{\text{a}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} \neq 0 \end{align*}\]
"less"\[\begin{align*} \text{H}_{\text{0}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} = 0 \\ \text{vs H}_{\text{a}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} < 0 \end{align*}\]
"greater"\[\begin{align*} \text{H}_{\text{0}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} = 0 \\ \text{vs H}_{\text{a}} & : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} > 0 \end{align*}\]
You must decide what kind of \(\text{H}_{\text{a}}\) you want to test before you observe (or analyse) the data
You can’t use a two-sided alternative \[ \text{H}_{\text{a}} : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} \neq 0 \]
and reject \(\text{H}_{\text{0}}\), noting that males have longer bills than females and then decide to use the more powerful alternative
\[ \text{H}_{\text{a}} : \hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}} > 0 \]
Using the null distribution we can obtain a p-value
the probability that a test statistic as or more extreme than the observed statistic would occur if the null hypothesis were true
We count how many difference of means in the null distribution are
the observed difference of means \(\hat{\mu}_{\text{Males}} - \hat{\mu}_{\text{Females}}\)
Why “in absolute value” for the two-sided test?
In our case, the p-value is reported to be 0
# p value for the simpler alternative of the bill lengths are different
adelie_null_distribution |>
get_p_value(obs_stat = obs_diff_means, direction = "two-sided")# A tibble: 1 × 1
p_value
<dbl>
1 0
# p value for the stronger alternative of male bill lengths are larger
adelie_null_distribution |>
get_p_value(obs_stat = obs_diff_means, direction = "greater")# A tibble: 1 × 1
p_value
<dbl>
1 0
This doesn’t make sense so instead we say the \(p\) is less than 0.001
\[\text{p value} = p = \frac{1}{P}\]
\(P\) is the number of permuted data sets we generated (1000 here)
This is the basis of Null Hypothesis Significance Testing or NHST
The null hypothesis is the situation we test; no difference in means between groups, no relationship between x and y
Look for evidence against this null hypothesis; unlikely values of statistics if the null hypothesis were true
We permuted the data to generate the null distribution of the test statistic
the sampling distribution of the test statistic under \(\text{H}_0\)
Much theoretical work in statistics is in deriving sampling distributions of different statistics, or types of statistics, using maths instead of resampling (permuting)
The standard normal distribution is a Gaussian distribution with \(\mathsf{\mu = 0}\) & \(\mathsf{\sigma^2 = 1}\)
People often call values from this distribution z-scores
Calculate z-scores by subtracting the mean score from each score and dividing by \(\sigma\)
\[z_i = \mathsf{\frac{\text{score}_i - \text{mean}}{\text{standard deviation}}}\]
\[z_i = \mathsf{\frac{14 - \text{14.95}}{\text{5.47}}} = \mathsf{-0.17}\]
Negative z-scores lie below the sample mean, positive scores above it
A z-score of -22.52 tells us that the score of 2mm is 22.52 standard deviations below the mean
68% probability z-score lies between -1–1; 95% probability z-score lies between -1.96–1.96
Probability distributions, like the standard normal or t distribution, help us compute the probability of obtaining a statistics if the null hypothesis were true
This is the probability due the chance
Under standard normal, probability of observing a score of 2.5 is 0.006
Under t distribution with 5 df, probability of observing a score of 2.5 is 0.027
t distribution often used for small samples & where sample variance is estimated (not know) which for many samples will be an underestimate of the true variance
The p-value tells us nothing about the probability of the null hypothesis or the alternative hypothesis
Redrawn from Krzywinski & Altman (2013) Significance, P values and t-tests. Nature Methods 10(11) 1041–1042 doi: 10.1038/nmeth.2698
Whether a test is one- or two-tailed is a common distinction in research studies
Relates to the hypothesis:
This is where concept of one- or two-tailed tests comes from; in which tail do we find the rejection region
The classical test for a difference of means is known as a two-sample t test
Two Sample t-test
data: bill_length_mm by sex
t = -5.4197, df = 38, p-value = 3.557e-06
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
-4.374686 -1.995314
sample estimates:
mean in group female mean in group male
37.355 40.540
The two-sample t test is a special case of a linear model
Call:
lm(formula = bill_length_mm ~ sex, data = adelie_bills)
Residuals:
Min 1Q Median 3Q Max
-4.2400 -1.0262 0.0025 0.8350 4.8450
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.3550 0.4155 89.89 < 2e-16 ***
sexmale 3.1850 0.5877 5.42 3.56e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.858 on 38 degrees of freedom
Multiple R-squared: 0.436, Adjusted R-squared: 0.4211
F-statistic: 29.37 on 1 and 38 DF, p-value: 3.557e-06
The two-sample t test is a special case of a linear model
Term Contrast Estimate Std. Error z Pr(>|z|) S 2.5 %
sex mean(male) - mean(female) 3.19 0.588 5.42 <0.001 24.0 2.03
97.5 %
4.34
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
Type: response
We will focus on linear models on this course
In NHST we obtain the probability (\(p\)) of observing as extreme a statistic as the observed if the null hypothesis were true: evidence against the null
If \(p\) is small we reject the null hypothesis & accept the alternative: chance that we falsely reject the Null
If \(p\) is large we fail to reject the null hypothesis & thus reject the alternative: chance that we falsely fail to reject the Null
Randall Munroe http://xkcd.com/1478/
A major problem with peoples’ use of NHST is the over-emphasis put on the p-value
The p-value is just a measure of conditional probability; just because you get a low p-value doesn’t mean the effect is important
Increase the sample size and, all things equal, you can detect smaller effects (statistically significant effects)
Important to ask if the effect size is important or relevant?
Effect size refers to the
Statistical power is the ability of a study to detect a real effect
Power is a function of the Type II error:
\[\mathsf{power} = 1 - \beta\]
Power ranges from 0 (no chance of detecting the real effect) to 1 (100% chance of detecting the real effect)
There is no point in conducting a study, especially one involving animals, that has low chance of finding the real effect
Power is a function of several important factors
For simple models (t-tests, one-way ANOVA, etc) there are relatively simple equations for computing the power of a planned study for given sample size, effect size etc.
For more complex models, power calculations become difficult; simulation using a computer is a solution
In NHST, too much emphasis is placed on just the sample statistic (mean, difference of means, regression slope \(\beta_j\), etc)
A different way of presenting statistical results is the confidence interval
Confidence intervals can be interpreted as (from Wikipedia)
As with the resampling (permutation) and theoretical approaches we used for hypothesis testing we can compute a resampling or theoretical version of a confidence interval
The resampling version uses a non-parametric bootstrap procedure
The standard error of the mean (SEM)
the estimated standard deviation of the means of all possible samples of the variable
Rather than use this theoretical result we could resample the data to generate the sampling distribution
In stead of permuting the data (resampling without replacement) we’ll use resampling with replacement
With replacement means that each time we draw a single data point to add to our sample, we put it back so it can be drawn again
A single bootstrap sample then is the same size as our data
But it will include repeats of some of the observed values
This also means some values in the data won’t be in a bootstrap sample
A single bootstrap sample can be generated using rep_sample_n()
Set replace = TRUE to get a bootstrap sample instead of a permutation
# A tibble: 40 × 3
# Groups: replicate [1]
replicate bill_length_mm sex
<int> <dbl> <fct>
1 1 36 female
2 1 39.1 male
3 1 41.3 male
4 1 38.2 male
5 1 40.6 male
6 1 40.6 male
7 1 40.8 male
8 1 38.8 male
9 1 33.5 female
10 1 37.9 female
# ℹ 30 more rows
To generate multiple samples, set reps
# A tibble: 840 × 3
# Groups: replicate [21]
replicate bill_length_mm sex
<int> <dbl> <fct>
1 1 39.6 female
2 1 37.3 female
3 1 42.2 female
4 1 35.9 female
5 1 41.6 male
6 1 39.7 female
7 1 43.2 male
8 1 41.3 male
9 1 36.3 male
10 1 43.2 male
# ℹ 830 more rows
Can use the bootstrap to generate a CI for the mean bill length of the male Adelie penguins
The percentile CI uses percentiles of the distribution. E.g. for 95% CI
Interval: 39.8 – 41.24mm
Difference of means of 21 bootstrap samples of the Adelie penguin bill lengths
21 samples is too small to do much with
Difference of means of 1000 bootstrap samples of the Adelie penguin bill lengths
# A tibble: 1 × 2
lower_ci upper_ci
<dbl> <dbl>
1 1.98 4.33
Does the confidence interval include 0?
Theoretical CI for the mean bill length for the male penguins
Mean of the males — 40.54; SD of the males — 1.71
Standard error of mean — \(\frac{\hat{\sigma}}{\sqrt{n}} = \frac{1.71}{\sqrt{20}} = \frac{1.71}{4.47} = 0.38\)
Critical value of t for 95% test — 2.093
qt(0.025, df = 20-1, lower.tail = FALSE)95% Confidence interval is
\[\mathsf{CI} = \overline{y} \pm (2.093 \times \mathsf{SEM})\]
Lower CI limit — 39.74
Upper CI limit — 41.34
The theoretical 95% CI for the difference of means for the male and female Adelie penguins is: 1.99 – 4.38
The bootstrap 95% CI is: 1.98 – 4.33
Resampling-based inference is a general approach to inference in statistics
Theoretical approaches developed in era of no computers
Resampling methods exploit fast computation
Fewer assumptions, but some assumptions remain
Theoretical approaches still have their place